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Abstract 

The design of high-precision sensing devises becomes ever more difficult and expensive. At the 
same time, the need for precise calibration of these devices (ranging from tiny sensors to space tele¬ 
scopes) manifests itself as a major roadblock in many scientific and technological endeavors. To achieve 
optimal performance of advanced high-performance sensors one must carefully calibrate them, which 
is often difficult or even impossible to do in practice. In this work we bring together three seemingly 
unrelated concepts, namely Self-Calibration, Compressive Sensing, and Biconvex Optimization. The 
idea behind self-calibration is to equip a hardware device with a smart algorithm that can compensate 
automatically for the lack of calibration. We show how several self-calibration problems can be treated 
efficiently within the framework of biconvex compressive sensing via a new method called SparseLift. 
More specifically, we consider a linear system of equations y = DAx, where both x and the diagonal 
matrix D (which models the calibration error) are unknown. By “lifting” this biconvex inverse prob¬ 
lem we arrive at a convex optimization problem. By exploiting sparsity in the signal model, we derive 
explicit theoretical guarantees under which both x and D can be recovered exactly, robustly, and 
numerically efficiently via linear programming. Applications in array calibration and wireless commu¬ 
nications are discussed and numerical simulations are presented, confirming and complementing our 
theoretical analysis. 


1 Introduction 

The design of high-precision sensing devises becomes ever more difficult and expensive. Meanwhile, the 
need for precise calibration of these devices (ranging from tiny sensors to space telescopes) manifests itself 
as a major roadblock in many scientific and technological endeavors. To achieve optimal performance 
of advanced high-performance sensors one must carefully calibrate them, which is often difficult or even 
impossible to do in practice. It is therefore necessary to build into the sensors and systems the capability 
of self-calibration - using the information collected by the system to simultaneously adjust the calibration 
parameters and perform the intended function of the system (image reconstruction, signal estimation, tar¬ 
get detection, etc.). This is often a challenging problem requiring advanced mathematical solutions. Given 
the importance of this problem there has been a substantial amount of prior work in this area. Current 
self-calibration algorithms are based on joint estimation of the parameters of interest and the calibration 
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parameters using standard techniques such as maximum likelihood estimation. These algorithms tend to 
be very computationally expensive and can therefore be used only in relatively simple situations. This has 
prevented their widespread applications in advanced sensors. 

In this paper we take a step toward developing a new framework for self-calibration by bringing together 
three seemingly unrelated concepts, namely Self-Calibration, Compressive Sensing, and Biconvex Opti¬ 
mization. By “lifting” the problem and exploiting sparsity in the signal model and/or the calibration 
model we express the biconvex problem of self-calibration as a convex optimization problem, which can 
be solved efficiently via linear programming. This new method, called SparseLift , ensures equivalence 
between the biconvex problem and the convex problem at the price of a modest increase in the number of 
measurements compared to the already perfectly calibrated problem. SparseLift is inspired by PhaseLift, 
an algorithm for phase retrieval da Ei, and by the blind deconvolution framework via convex programming 
due to Ahmed, Recht, and Romberg [Tj. 

Self-calibration is a field of research in its own right, and we do not attempt in this paper to develop a 
framework that can deal with any kind of self-calibration problems. Instead our goal is more modest—we 
will consider a special, but at the same time in practice very important, class of self-calibration problems. 
To that end we first note that many self-calibration problems can be expressed as (either directly or after 
some modification such as appropriate discretization) 

y = A(h)x + w, (1.1) 

where y represents the given measurements, A(h) is the system matrix depending on some unknown 
calibration parameters h and x is the signal (function, image, ...) one aims to recover and w is additive 
noise. As pointed out, we do not know A(h) exactly due to the lack of calibration. Our goal is to recover 
x and h from y. More precisely, we want to solve 

min \\A{h)x - y\\ 2 . (1.2) 

h,x 

If A(h) depends linearly on h , then (II.2ft is a bilinear inverse problem. In any case, the optimization 
problem (11.21) is not convex, which makes its numerical solution rather challenging. 

In many applications (11.ip represents an underdetermined system even if A(h) is perfectly known. For¬ 
tunately, we can often assume x to be sparse (perhaps after expressing it in a suitable basis, such as a 
wavelet basis). Moreover, in some applications h is also sparse or belongs to a low-dimensional subspace. 
Hence this suggests considering 

min || A(h)x - y || 2 + f(x, h), (1.3) 

h,x 

where f(x, h ) is a convex function that promotes sparsity in x and/or h. Often it is reasonable to assume 
that (II. 3p is biconvex, i.e., it is convex in h for fixed x and convex in x for fixed h. In other cases, one 
can find a close convex relaxation to the problem. We can thus frequently interpret (1 1.3 j) as a biconvex 
compressive sensing problem. 

Yet, useful as (11.3 j) may be, it is still too difficult and too general to develop a rigorous and efficient 
numerical framework for it. Instead we focus on the important special case 

y = D(h)Ax + w, (1.4) 

where D(h ) is a diagonal matrix depending on the unknown parameters h and A is known. Of particular 
interest is the case where x is sparse and the diagonal entries of D belong to some known subspace. Hence, 
suppose we are given 

y = DAx + w , D — diag(Bh), (1.5) 
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where y G C ixl , D : C LxL , h G C fcxl , A G C LxN , B G c ixfc and a: G C JVxl is n-sparse. D = diag (Bh) can 
be interpreted as the unknown calibrating parameters lying in the range of a linear operator B. Indeed, 
this model will be the main focus of our investigation in this paper. More examples and explanations of 
the model can be found in the next two subsections as well as in Sections 16.31 and 16.41 

Before moving on to applications, we pause for a moment to discuss the issue of identihability. It is clear 
that if (D, x) is a pair of solutions to (II.4p . then so is (ctD, -x) for any non-zero a. Therefore, it is only 
possible to uniquely reconstruct D and x up to a scalar. Fortunately for most applications this is not an 
issue, and we will henceforth ignore this scalar ambiguity when we talk about recovering D and x. 

The reader may think that with the model in (II .4(1 we finally have reached a level of simplicity that is trivial 
to analyze mathematically and no longer useful in practice. However, as is often the case in mathematics, a 
seemingly simple model can be deceptive. On the one hand a rigorous analysis of the diagonal calibration 
model above requires quite non-trivial mathematical tools, and on the other hand this “simple” model 
does arise in numerous important applications. Perhaps the most widely known instance of (II .4(1 is blind 
deconvolution , see e.g. [I, [29]. Other applications include radar, wireless communications, geolocation, 
direction finding, X-ray crystallography, cryo-electron microscopy, astronomy, medical imaging, etc. We 
briefly discuss two such applications in more detail below. 


1.1 Array self-calibration and direction-of-arrival estimation 

Consider the problem of direction-of-arrival estimation using an array with L antennas. Assuming that 
n signals are impinging on the array, the output vector (using the standard narrowband model) is given 
by m 

n 

y{t) = ^2 D(h)a(9 k , ipk)x k (t) + w(t). (1.6) 

k =1 

The (unknown) calibration of the sensors is captured by the L x L matrix D. By discretizing the ar¬ 
ray manifold a{6k,'4>k)i where (0,VO are the azimuth and elevation angles, respectively, we can arrange 
equation (1 1.6 j) as (see [20] ) 

y = D(h)Ax + w, (1.7) 

where w now also includes discretization error. Here a(9, ijj) is a completely known function of ( 9 , i)j) and 
does not depend on calibration parameters, so A is a known matrix. 

To give a concrete example, assuming for simplicity all sensors and signal sources lie in the same plane, 
for a circular array with uniformly spaced antennas, the fc-th column of the L x N matrix A is given 
by {e-^Hvk^yL-i with Vk _ [ s i n (0 fc ) cos (9 k )] T ,Uj = 2sin ^ /L) [sin(^) cos(^)] T ,where -f < 9 0 < 
■ ■ ■ < 9n -i > ^ is a discretization of the angular domain (the possible directions of arrival), the entries of 
u = o represent the geometry of the sensor array, L is the number of sensors, and A is the array 

element spacing in wavelengths. 

An important special case is when D(h ) is a diagonal matrix whose diagonal elements represent unknown 
complex gains associated with each of the antennas. The calibrating issue may come from position offsets 
of each antenna and/or from gain discrepancies caused by changes in the temperature and humidity of 
the environment. If the true signal directions (di, ^i),..., (9 n , ^ n ) lie on the discretized grid of angles 
($i, Vh) • • • , (9n,4>n) corresponding to the columns of A, then the N x 1 vector x is n-sparse, otherwise x 
is only approximately n-sparse. In the direction finding problem we usually have multiple snapshots of the 
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measurements y(t), in which case y and x in (JTTTj) become matrices, whose columns are associated with 
different snapshots [2D]- 

1.2 The Internet of Things, random access, and sporadic traffic 

The Internet of Things (IoT) will connect billions of wireless devices, which is far more than the current 
Fourth-Generation (4G) wireless system can technically and economically accommodate. One of the many 
challenges of 5G, the future and yet-to-be-designed wireless communication system, will be its ability to 
manage the massive number of sporadic traffic generating IoT devices which are most of the time inactive, 
but regularly access the network for minor updates with no human interaction |0D]. Sporadic traffic will 
dramatically increase in the 5G market and it is a common understanding among communications engineers 
that this traffic cannot be handled within the current 4G random access procedures. Dimensioning the 
channel access according to classical information and communication theory results in a severe waste of 
resources which does not scale towards the requirements of the IoT. This means among others that the 
overhead caused by the exchange of certain types of information between transmitter and receiver, such as 
channel estimation, assignment of data slots, etc, has to be avoided. In a nutshell, if there is a huge number 
of devices each with the purpose to sporadically transmit a few bits of data, we must avoid transmitting 
each time a signal where the part carrying the overhead information is much longer than the part carrying 
the actual data. The question is whether this is possible. 

In mathematical terms we are dealing with the following problem (somewhat simplified, but it captures 
the essence). For the purpose of exposition we consider the single user/single antenna case. Let h denote 
the channel impulse response and x the signal to be transmitted, where a? is a sparse signal, its non-zero 
coefficients containing the few data that need to be transmitted. We encode x by computing z = Ax, 
where A is a fixed precoding matrix, known to the transmitter and the receiver. However, the non-zero 
locations of x are not known to the receiver, and nor is h, since this is the overhead information that we 
want to avoid transmitting, as it will change from transmission to transmission. The received signal is given 
by the convolution of h and z , i.e., y = h*z. The important point here is that the receiver needs to recover 
x from y, but does not know h either. This type of problem is often referred to as blind deconvolution. 
After applying the Fourier transform on both sides and writing D = diag(fa) (where ~ denotes the Fourier 
transform), we obtain y = DAx , which is of the form III.4)1 modulo a change of notation. A more detailed 
example with concrete choices for A and D is given in Section 16.41 

1.3 Notation and outline 

Matrices are written in uppercase boldface letters (e.g. X), while vectors are written in lowercase boldface 
letters (e.g. x), and scalars are written in regular font or denoted by Greek letters. For a matrix X with 
entries X t] we denote ||AC||! := JA. \Xij\, ||X||* is the nuclear norm of X, i.e., the sum of its singular 
values; ||.X’|| 00 = max^ |Xyj, ||W|| is the operator norm of X and ||X||^ is its Frobenius norm. Tr(X) 
denotes the trace of X. If x is a vector, ||a?|| is its 2-norm and ||a;||o denotes the number of non-zero entries 
in x, and x denotes the complex conjugate of x. 

The remainder of the paper is organized as follows. In Section [2] we briefly discuss our contributions in 
relation to the state of the art in self-calibration. Section [3] contains the derivation of our approach and 
the main theorems of this paper. Sections [4] and [5] are dedicated to the proofs of the main theorems. We 
present numerical simulations in Section [Tj] and conclude in Section [7] with some open problems. 
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2 Related work and our contributions 


Convex optimization combined with sparsity has become a powerful tool; it provides solutions to many 
problems which were believed intractable in the fields of science, engineering and technology. Two re¬ 
markable examples are compressive sensing and matrix completion. Compressive sensing HH H31 H9] , an 
ingenious way to efficiently sample sparse signals, has made great contributions to signal processing and 
imaging science. Meanwhile, the study of how to recover a low-rank matrix from a seemingly incomplete 
set of linear observations has attracted a lot of attention, see e.g. [51 [71 [33]. 

Besides drawing from general ideas in compressive sensing and matrix completion, the research presented 
in this work is mostly influenced by papers related to PhaseLift ®mm and in particular by the very 
inspiring paper on blind deconvolution by Ahmed, Recht, and Romberg [T|. PhaseLift provides a strategy of 
how to reconstruct a signal x from its quadratic measurements i/i = \(x, Zi )| 2 via “lifting” techniques. Here 
the key idea is to lift a vector-valued quadratic problem to a matrix-valued linear problem. Specifically, 
we need to find a rank-one positive semi-definite matrix X = xx* which satisfies linear measurement 
equations of X: y { = Tr(XZ,) where Z % = ZiZ*. Generally problems of this form are NP-hard. However, 
solving the following nuclear norm minimization yields exact recovery of xx* : 

min||X||*, subject to yi = Tr(XZj), X >z 0. (2.1) 

This idea of ” Lifting” also applies to blind deconvolution [T], which refers to the recovery of two unknown 
signals from their convolution. The model can be converted into the form (13.1|) via applying the Fourier 
transform and under proper assumptions X 0 = h 0 Xy can be recovered exactly by solving a nuclear norm 
minimization program in the following form, 

min ||X||*, subject to yi = Tr(JCAj), A, is a rank-one matrix, (2-2) 

if the number of measurements is at least larger than the dimensions of both h and x. Compared with 
PhaseLift (12. ip . the difference is that X can be asymmetric and there is of course no guarantee for 
positivity. Ahmed, Recht, and Romberg derived a very appealing framework for blind deconvolution via 
convex programming, and we borrow many of their ideas for our proofs. 

The idea of our work is motivated by the two examples mentioned above and our results are related to 
and inspired by those of Ahmed, Recht, and Romberg. However, there are important differences in our 
setting. First, we consider the case where one of the signals is sparse. It seems unnecessary to use as many 
measurements as the signal length and solve (12.2[i in that case. Indeed, as our results show we can take 
much fewer measurements in such a situation. This means we end up with a system of equations that is 
underdetermined even in the perfectly calibrated case (unlike pQ), which forces us to exploit sparsity to 
make the problem solvable. Thus [I] deals with a bilinear optimization problem, while we end up with a 
biconvex optimization problem. Second, the matrix A in [T] is restricted to be a Gaussian random matrix, 
whereas in our case A can be a random Fourier matrix as well. This additional flexibility is relevant in 
real applications. For instance a practical realization of the application in 5G wireless communications as 
discussed in Section [Inl and Section ItuTl would never involve a Gaussian random matrix as spreading/coding 
matrix, but it could definitely incorporate a random Fourier matrix. 

Bilinear compressive sensing problems have also been investigated in e.g. [16] [38] [28]. These papers focus 
on the issue of injectivity and the principle of identihability. Except for the general setup, there is little 
overlap with our work. Self-calibration algorithms have been around in the literature for quite some time, 
see e.g. EH 133121 El] for a small selection, with blind deconvolution being a prominent special case, cf. [29] 
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and the references therein. It is also worth noting that [23, 3][33j have developed self-calibration algorithms 
by using sparsity, convex optimization, and/or ideas from compressive sensing. In particular, [23, E] give 
many motivating examples of self-calibration and propose similar methods by using the lifting trick and 
minimizing || • ||i or || • ||i + A|| • ||* to exploit sparsity and/or low-rank property. While these algorithms 
are definitely useful and interesting and give many numerical results, most of them do not provide any 
guarantees of recovery. To the best of our knowledge, our paper is the first work to provide theoretical 
guarantee for recovery after bringing self-calibration problems into the framework of biconvex compressive 
sensing. Uncertainties in the sensing matrix have been analyzed for instance in [261 [To] , but with the focus 
on quantifying the uncertainty, while our current work can be seen as a means to mitigate the uncertainty 
in the sensing matrix. 


3 Self-calibration, biconvex optimization, and SparseLift 

As discussed in the introduction, we are concerned with y = D(h)Ax + in, where the diagonal matrix D 
and x are unknown. Clearly, without further information it is impossible to recover x (or D) from y since 
the number of unknowns is larger than the number of measurements. What comes to our rescue is the fact 
that in many applications additional knowledge about D and x is available. Of particular interest is the 
case where x is sparse and the diagonal entries of D belong to some known subspace. 

Hence, suppose we are given 

y = DAx + w, D = diag(-Bfa), (3-1) 

where y G C Lxl , D : C LxL , h G C fcxl , A G C LxN , B G C Lxfc and x G C Nxl is n-sparse. E.g., B may be a 
matrix consisting of the first k columns of the DFT (Discrete Fourier Transform) matrix which represents a 
scenario where the diagonal entries in D are slowly varying from entry to entry. One may easily formulate 
the following optimization problem to recover h and x. 

min11 diag (Bh)Ax — y\\ 2 . (3.2) 

h,x 

We can even consider adding an 0-component in order to promote sparsity in x, 

min|| diag(Bh)Ax — y\\ 2 + A||cc|[i, A > 0. (3.3) 

h,x 

However, both programs above are nonconvex. For (13.21) . it is an optimization with a bilinear objective 
function because it is linear with respect to one vector when the other is fixed. For (13.31) . the objective 
function is not even bilinear but biconvex [22]. 


3.1 SparseLift 

To combat the difficulties arising from the bilinearity of the objective function in (13.21) . we apply a novel 
idea by “lifting” the problem to its feasible set BE! . It leads to a problem of recovering a rank-one 
matrix hx T instead of finding two unknown vectors. Denote by a, the i-th column of A T and by b L the 
f-th column of B*. A little linear algebra yields 


Vi = ( Bh)iX T ai = b*hx T ai. 


(3.4) 
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Let X := hx T and define the linear operator A : c kxN —> C L , 


y = A(X) := {b*Xa,i}f =1 

Now the problem is to find a rank-one matrix X satisfying the linear constraint A(X) = A(X 0 ) where 
X 0 = h 0 x q is the true matrix. A popular method is to take advantage of the rank-one property and use 
nuclear norm minimization. 


min||X||*, subject to A(X) = A(X 0 ) 
where ||X||* denotes the nuclear norm of X which is the sum of singular values. 

Before we proceed, we need a bit of preparations. By using the inner product defined on c kxN as (X, Z) := 
Tr (XZ*), the adjoint operator A*(u) : C L — > c kxN of A{X) = {b*Xai}f =1 and A*A(X) have their forms 
as 

L L 

A*(u ) := ^ uibia *, A*A(X) = ^ bib*Xaia*. 

1=1 1=1 

We also need a matrix representation f :f x kN of A : A(X) = {tfXai}f =1 such that 

$vec(X) = vec(*A(X)), (3.5) 

vec(X) reshapes matrix X into a long column vector in lexicographic order. For each /, b*Xai = (a/ <g) 
bi) T vec(X) follows from the property of Kronecker product, where 6/ is the complex conjugate of bi. It 
naturally leads to the block form of $*, 

= [0i, ■ ■ ■ , 0i, ■ ■ • , 0 lL L =i , 4>i = ai®bi : kN x 1. (3.6) 

With all those notations in place, the following formula holds 

L L 

vec(.A*.A(X)) = vec(X) = 0*0* vec(X) = [(«;«*) ® (W] vec(X). 

i=i i=i 


$ also has an explicit representation of its columns. Denote 

0tj : = diag(6j)5j, i = 1, 2, ■ ■ • , k and j = 1, 2, ■ ■ • , N. (3.7) 

bi and a,- are the i- th and j-th column of B and A respectively. Both bi and a,j are of the size Lx 1. 
Then we can write into 


$ 


0i,i,''' , 0fc,i,''' ,01, 


N , 


, 0fc. 


TV 


kNxL 


Assume we pick a subset T p C {1,2, ••• , L} with |r p | = Q, we define A P (X) := {b* l Xai}i G r p from 
C kxN c Q and ^ P (X) := £ /gr btfXcua*. A p also has its matrix representation : Q x /cX 
as 

= [0i,-■■ ,0n-•• ,0 lL g iv (3- 8 ) 

such that vec(.Ap(X)) = <E>pVec(X) and vec(.A*.Ap(X)) = $*$ p vec(X). 
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Concerning the support of x 0l X 0 and vec(X 0 ), we can assume without loss of generality that the support 
of Xq is its first n entries and thus X 0 := h^x^ have its first n columns nonzero. By a slight abuse of 
notation, we write fl for the supports of x 0 , X 0 and vec(X 0 ). Let x^ and Xq denote the orthogonal 
projections of x and X on fl, respectively. Define An and *4.^ as the restriction of A and A* onto D, 

L 

MX) = WXtimyh = W-Xna, A‘ n (u) = J2 nib,a) a (3.9) 

(=1 

One can naturally define : L x kN with non-zero columns {<Pij}i<i<k,i<j<n satisfying $nvec(X) = 
An(X) and : LxkN with non-zero columns {<Pi,j}i<i<k,j>n such that = vec(J2f=i uibia* n± ). 

Even if we stipulate that the degrees of freedom in D is smaller than L by assuming that the diagonal 
entries of D belong to some subspace, without any further knowledge about x the self-calibration problem 
is not solvable, since the number of unknowns exceeds the number of measurements. Fortunately, in many 
cases we can assume that x is sparse. Therefore, the matrix X = hx T is not only of rank one but also 
sparse. A common method to recover X from y is to use a linear combination of || • ||* and || • ||i as an 
objective function (See [30] for an example in phase retrieval) and solve the following convex program 

min ||X||* + A||X||i, A > 0, subject to A(X) = A(Xq). (3.10) 

(I3.10P tries to impose both sparsity and rank-one property of a matrix. While (I3.10P seems a natural way 
to proceed, we will choose a somewhat different optimization problem for two reasons. First, it may not 
be straightforward to find a proper value for A, see also Figures [ 2 ] and [3] in Section [ 6 ] Second, in [32] it is 
shown that if we perform convex optimization by combining the two norms as in (13.101) . then we can do no 
better, order-wise, than an algorithm that exploits only one of the two structures. Since only minimizing 
|| • ||* requires L = 0(N + k ) (which can be too large for large N ) and there also holds ||-Z||i > \\Z\\^ for any 
matrix Z, both of them suggest that perhaps in certain cases it may suffice to simply try to minimize ||X||i, 
which in turn may already yield a solution with sparse structure and small nuclear norm. Indeed, as well 
will demonstrate, A-minimization performs well enough to recover X 0 exactly under certain conditions 
although it seemingly fails to take advantage of the rank-one property. Lienee we consider 

minUXll!, subject to A(X) = A(X 0 ). (3-11) 

This idea first lifts the recovery problem of two unknown vectors to a matrix-valued problem and then 
exploits sparsity through fj-minimization. We will refer to this method as SparseLift. It takes advan¬ 
tage of the “linearization” via lifting, while maintaining the efficiency of compressive sensing and linear 
programming. 

If the observed data are noisy, i.e., y = Al(X 0 ) + w with ||in|| < r), we take the following alternative convex 
program 

minUXll!, subject to ||*4.(X) — Al(X 0 )|| < 77 . (3.12) 

It is useful to rewrite (13. lip in matrix-vector form. Noticing that A(V) = thu for v = vec(V), V G c kxN 
and v G c kNxl , we have the following equivalent formulation of (13.111) 

min ll'uHi, subject to &v = (3.13) 

where v 0 := vec(X 0 ). The matrix form of (13.121) is obtained similarly, 

min ||f||i, subject to ||$d — y\\ < y. 

This model attempts to recover X 0 from noisy observation y, where y = t&'Uo + w and ||in|| < y. 


(3.14) 














3.2 Main results 


Our main finding is that SparseLift enables self-calibration if we are willing to take a few more additional 
measurements. Furthermore, SparseLift is robust in presence of additive noise. In our theory we consider 
two choices for the matrix A: in the one case A is a Gaussian random matrix; in the other case A consists 
of rows chosen uniformly at random with replacement from the N x N Discrete Fourier Transform (DFT) 
matrix F with F*F = FF* = NIn- In the latter case, we simply refer to A as a random Fourier matrix. 
Note that we normalize F to F*F = NIn simply because E(aa*) — In where a is a column of A*. 
Theorem 3.1. Consider the model (13. ip and assume that B is an L x k tall matrix with B*B = Ik, h 
is a k x 1 dense vector and x is an n-sparse and N x 1 vector. Then the solution X to (13.lip is equal to 
h 0 XQ with probability at least 1 — 0(L ~ a+1 ) if 

1. A is an L x N real Gaussian random matrix with L < N with each entry A tJ ~ W(0,1) and 

T~TT — C a fJ^ aax knlog(kN). (3.15) 

log L 

2. A is an L x N random Fourier matrix with L < N and 

L — PQ, P> 1 ° s( ) v/ ?" 7 ) , Q > C„/4 m *n(log(i) + log(WV)) (3.16) 

log 2 

where 7 = V2N{\og{2kN ) + 1) + 1. 

In both cases C a > 0 is growing linearly with a and /i rnax is defined as the largest absolute value in the 
matrix VLB, i.e., 

Pmax := max Vl\B,j\. 
d 

Remarks: We note that maxi<;<£ ||h;|| 2 < 1 < anc l hmax > 1. The definition of /i max 

seems strict but it contains a lot of useful and common examples. For example, a matrix B has /x max = 1 
if each column is picked from an L x L unitary DFT matrix and B B = I k . We expect the magnitude of 
each entry in B does not vary too much from entry to entry and thus the rows of B are incoherent. 
Theorem 3.2. If the observed y is contaminated by noise, namely, y = A(Xo) + w with ||io|| < y, the 
solution X given by \3.12\) has the following error bound. 

1. If A is a Gaussian random matrix, 

\\X-X 0 \\ F < {Co + CxVkVy 

with probability of success at least 1 — 0(L~ a+1 ). L satisfies (13.151) and both Co and C\ are both 
constants. 

2. If A is a random Fourier matrix, 


||X - X 0 ||f < (C 0 + C\VPVkn)r\ 

with probability of success at least 1 — 0(L~ a+1 ). L and P satisfy (13.16P and both C 0 and C\ are both 
constants 
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The proof of the theorem above is introduced in Section [5j The solution X obtained by SparseLift, see 
(13. lip or (I3.12p for noisy data, is not necessarily rank-one. Therefore, we compute the singular value 
decomposition of X and set h = y r o\u\ and x = \fo\v\, where by is the leading singular value and u\ 
and v± are the leading left and right singular vectors, respectively. 

Corollary 3.3. Let aiiiivl be the best Frobenius norm approximation to X, then there exists a scalar «o 
and a constant Co such that 

\\h 0 -a 0 h\\ < c 0 min(e/||/i 0 ||, ||Jio||), ||£C 0 - ay 1 x\\ < c 0 min(e/||cco||, ||#o||) 

where e = || A — X 0 ||f- 

hum also derived error bounds of recovering the signals from the recovered ” lifted” matrix by using the 
leading eigenvector or leading left and right singular vectors. 01 focus on PhaseLift and the recovered 
vector from X is unique up to a unit-normed scalar. Our corollary is more similar to Corollary 1 in pjj 
because both of them deal with a bilinear system instead of a quadratic system like PhaseLift. There are 
two signals to recover from X in our case and thus they are unique up to a scalar «o- Without prior 
information, no theoretical estimate can be given on how large cy is. 

The proof of Corollary 13.31 consists of considering the leading eigenvalue and eigenvector of the Hermitian 
matrices X X and XX and then using the sin-0 Theorem in [l7j. The same idea is implemented in the 
proof of Theorem 1.2 from [12] where the detailed reasoning can be found. 

We could use a mixed ^ 2 -^ 1 -norm in (13. lip to enforce column-sparsity in X , i.e., replace ||X||i by ||^C|| 2 ,i- 
Indeed, analogous versions of our main results, Theorem 13.11 and Theorem 13.21 hold for the mixed 
norm case as well, which will be reported in a forthcoming paper. 


4 Proof of Theorem 13.1 

The outline of the proof of Theorem 13.11 follows a meanwhile well-established pattern in the literature of 
compressive sensing mm and low-rank matrix recovery, but as is so often the case, the difficulty lies 
in the technical details. We first derive sufficient conditions for exact recovery. The fulfillment of these 
conditions involves computing norms of certain random matrices as well as the construction of a dual 
certificate. The main part of the proof is dedicated to these two steps. While the overall path of the proof 
is similar for the case when A is a Gaussian random matrix or a random Fourier matrix, there are several 
individual steps that are quite different. 

4.1 A sufficient condition for exact recovery 

We will first give a sufficient condition for exact recovery by using SparseLift. Then we prove all the 
proposed assumptions are valid. 

Proposition 4.1. The rank-1 matrix X 0 = h^xff with support is the unique minimizer to SparseLift 

dm]) if 

(sgn(-Xo), H) + || JTq_l||i > 0 (4.1) 

for any H e Ker(A)\{0}. 

This proposition can be regarded as a sufficient condition of exact recovery of a sparse vector via i\ 
minimization in compressive sensing (See Theorem 4.26(a) in [H5]. They are actually the same but appear 
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in different formulations). Based on (14. ip . we propose an alternative sufficient condition which is much 
easier to use. 

Proposition 4.2. X 0 is the unique minimizer if there exists a Y e Range(A*) satisfying the following 
conditions, 

||sgn(X 0 ) " ll*V||oo (4.2) 

and the linear operator A satisfies 

— In || < ^, ||-4|| < 7 . (4.3) 

Remark: We will later specify 7 in (14.211) and (14.221) . 

Proof: The proof starts with (14.ip . Note that (sgn(X 0 ), H) = (sgn(X 0 ) — Y , H) if Y G Range(7l*) 
and H e Ker(*4.)\{0}. By splitting the inner product of (sgn(X 0 ) — Y,H) on 0 and fl -1 , the following 
expression can guarantee the exact recovery of l\ minimization, 

(sgn(Xo) - Y n ,H n ) ~ (Y n x 7 H a ±) + ||if^||i > 0 . 


By using the Holder inequality, we arrive at the stronger condition 

— || sgn(X 0 ) — l^n||ir||-ffa||F + (l — ||^f 2 - L ||oo)||-Hn- L l|i > 0 . (4.4) 

Due to H-An-An - In || < § and ||.A|| < 7 , we have ||.4.(-Hn)|| F > ^||i2n||F and ||„4(iT n x)||F < 7||i?n-L ||f- 
As a result, 

^ll^nllF < ||-4(/fn)||F = ||-4(/f^)||F < 7 |I^||f < 7ll^|li- (4-5) 

The second equality follows from A(Hn) + A(Hn^) = 0 . Substitute (14. 5 p into (14.4p and we only need to 
prove the following expression in order to achieve exact recovery, 

(l - HYVII 00 - V 27 H sgn(X 0 ) - TA>||f) ||ifnE|i > 0 . 

If || sgn(X 0 ) — Xn||F < 11^"f 2 -*- l|oo < | and H n ± ^ 0, the above expression is strictly positive. On the 

other hand, Hq± = 0 implies H = 0 from (14. 5p . Therefore, the solution X of SparseLift equals Xq if the 
proposed conditions (14.31) and (14.2p are satisfied. 


4.2 Local isometry property 

In this section, we will prove Lemma 14.31 the local isometry property, which achieves the first condition in 

(H). 

Lemma 4.3. For linear operators A and 4? defined in (I3.4[) and (13.51) respectively, 

\\^n-In\\ = \\A* n An-In\\<S 

holds true with probability at least 1 — L~ a+1 where In is the identity operator on 0 if 


11 

























1. ai is the l-th row of an LxN Gaussian random matrix A and L > G Q /i^ ax /cn max{logL/h 2 , log 2 L/5}. 

2. ai is the l-th row of an L x N random Fourier matrix A and L > CgH ^^ kn log L/5 2 . 

In both cases, C a grows linearly with respect to a and a > 1. 

In fact, it suffices to prove Lemma [4.61 and Lemma [4.71 because Lemma [4.31 is a special case of Lemma [4.71 
when Q = L and P = 1. 


4.2.1 Two ingredients of the proof 


Before we move to the proof of Lemma 14.61 and Lemma 14.71 we introduce two important ingredients which 
will be used later. One ingredient concerns the rows {bi}j l 1 of B. There exists a disjoint partition of index 
set {1, 2, • • • , L} into {r p }p =1 with each |r p | = Q' and Q' > C , /i^ iax A;log(L) such that 


max 

i <p<p 




< 4 £’ where T p = Y1 blb * 

l(zTp 


(4.6) 


The existence of such a partition follows from the proof of Theorem 1.2 in m and is used in Section 3.2 
of pjj. (14.6|1 also can give a simple bound for the operator norms of both T p and Tf l . 


max IIT 

i <p<p ‘ 


5£ 
“ 4L ’ 


max 

i< P <p 




(4.7) 


The other ingredient concerns the non-commutative Bernstein inequality. For Gaussian and Fourier cases, 
the inequalities are slightly different. This following theorem, mostly due to Theorem 6.1 in [36], is used 
for proving the Fourier case of (14.201) . 

Theorem 4.4. Consider a finite sequence of Zi of independent centered random matrices with dimension 
M x M. Assume that \\Zi\\ < R and introduce the random matrix 

s = y. z ‘- ( 4 - 8 ) 

ier p 

Compute the variance parameter 

Q Q 

a 2 = max (|| £ E(Z,Z;)\\, || ^ E(2;2,)||}, (4.9) 

1=1 1=1 

then for all t > 0 

For the Gaussian case we replace \\Zi\\ < R with ||i^< R where the norm || • of a matrix is dehned 
as 

\\ z \Ui ■= inf{E[exp(||Z||/u)] < 2}. 

U>(J 

The concentration inequality is different if ai is Gaussian because of the unboundedness of S. We are 
using the form in [27]. 
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Theorem 4.5. For a finite sequence of independent MxM random matrices Z\ with R := max^^g 
and S and a 2 as defined in (14.81) and (14.171) . we have the tail bound on the operator norm of S, 


Pr(||S|| > t) < 2M exp 


C « a 2 + log (A) Rt J 


(4.10) 


where C 0 is an absolute constant. 


4.2.2 A key lemma and its proof 


For each subset of indices r p , we denote 


Ap, n (X) = {tfXa l>n } 




a: 


■p,n 


u = 


^uibtaln, A^ fl A p ,ii{X) = ^2b l b* l Xa lfl al n (4.11) 




ZETp 


Under certain conditions, it is proven that A* n A p si does not deviate from the identity much for each p. 
We write this result into Lemma [4.61 

Lemma 4.6. For any equal partition {r p }p =1 o/{l,2, ••• , L} with |r p | = Q such that (14 .6 [) holds true 
(e.g. let Q' = Q), we have 


max sup 
1 <P< P ||X n ||F=l 


A’ rfl A p , n(X) 


Q x 

L Xn 



(4.12) 


with probability at least 1 — PL~ a > 1 — L~ a+1 if 

1. ai is the l-th row of an L x N Gaussian random matrix A and Q > Ca/f^^kn log 2 L. 

2. a t is the l-th row of an L x N random Fourier matrix A and Q > C a fj.fi iL ^kn log L. 

In both cases C a grows linearly with respect to a and a > 1. 

First, we prove the following key lemma. Lemma [4.61 is a direct result from Lemma [4.71 and the property 
stated in (14.61) . We postpone the proof of Lemma 14.61 to Section 14.2.31 

Lemma 4.7. For any fixed 0 < S < 1 and equal partition {F p }p =1 o/{l, 2, • • • , L} with |r p | = Q satisfying 
(I4.6p and (14.71) (e.a. let Q' = Q), we have 

max sup ||A; n A Pif2 W - T p X n \\ < ^, T p = ^ b i b *n ( 4 - 13 ) 

1<P<P ||Xn||F=l L ^ 

with probability at least 1 — PL~ a > 1 — L~ a+1 if 

1. ai is the l-th row of an Lx N Gaussian random matrix A and Q > Cafi^^knmaxfiogL/6 2 , log 2 L/8}. 

2. ai is the l-th row of an L x N random Fourier matrix A and Q > CafJ^^kn log L/8 2 . 

In both cases C a grows linearly with respect to a and a > 1. 


Proof: We prove the Gaussian case first; the proof of the Fourier case is similar. Let ep be a standard 
Gaussian random vector, denotes the projected vector on the support of x 0 with | supp(a3 0 )| = n - 
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We first rewrite the linear operator A* p n A p ^{X) — T p X^ into matrix form. Using the Kronecker product 
we define Z] as follows: 

Z>i := i a i,n a * t n ~ In,q) <8> {bib*). (4-14) 

Every Z/ is a centered random matrix since E(a^jja^) = Jjv,n, which follows from restricting a* a* to the 
support fl and taking its expectation. I is an N x N matrix with I n as its n x n leading submatrix. 
It is easy to check that 

vec(^ in ^, n (X) - T p X n ) = ]T Z; vec(Xn). 

ier p 

More precisely, a^n in Z\ needs to be replaced by a^n according to the property of Kronecker product. 
But there is no problem here since each ai is a real random vector. For simplicity, we denote Zi(X) : = 
bib* X (a^a* n — In,q)- I 11 order to use Theorem 14.51 to estimate || J2i=i Zi\\, we need to know and 

a 2 . We first give an estimate for the operator norm of Z[. 


\Zi\\ — \b*bi\\\ai t cia^Q — In,ci\\ < 


hmax^' | 


a i,n a i,n 


-T/v,n|| < 


h'max^ 


max 


M 2 ,i). 


(4.15) 


Here |||| 2 is a x 2 random variable with n degrees of freedom. This implies that 
following from Lemma 6 in [I]. Therefore there exists an absolute constant Cb such that 


R 


max \\Zi\U < 

1 <1<L 


CbvLJvn 

L 


The second step is to estimate <r 2 . It suffices to consider Z*Zi(X) because Zi is a Hermitian matrix. 

Z*Zt{X) = \\bi\\ 2 bib*X(ai^cii^ - I N , a ) 2 . 

The expectation of Z*Zi{X) is 


E{Z*Zi{X)) = {n + l)\\bi\\ 2 bib* XI N ^, 


(4.16) 


following from E(a^nfl^Q — IN,n) 2 — (n +1 )In,ci which can be proven by using similar reasoning of Lemma 

9 in pQ. 


a 2 = 

£ e (z;z,) 

— (n + 1) 



IdzTp 


ier p 


< 


hma M n + X ) 

L 


Y. b ’ b ' 


Shma x Qk(n + 1) 5 n 2 mSLX Qkn 

AL 2 ~ 2 L 2 


(4.17) 

(4.18) 

(4.19) 


where the inequality in the second line follows from \\bi\\ 2 < and the positivity of bib*. 

thing to do is to estimate log (before we can apply the concentration inequality (14.101) . 


that there holds 


log 


(VQR\ 


\ a 2 J 


< C\ log L 


The only 
We claim 
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for some constant C\. Notice that Q < L and R is at most of order L. It suffices to show the lower bound 
of <7 2 in order to prove the expression above. Following from (|4■ 1 7|) . we have a 2 > (n + 1) max/ e p p ||fy|| 4 . 
On the other hand, f|4.6[) leads to Ezer I HI 2 > IIE, e r p bi b*|| > since the trace of a positive serni- 
dehnite matrix must be greater than its operator norm. Therefore we conclude that a 2 > because 

max Zerp ||fy|| 2 > ^ E« s r p INI 2 > Jl and \ T p\ = Q- 

Now we are well prepared to use (I4.10p . Zj can be simply treated as a kn x kn matrix for each l because 
its support is only kn x kn. With R < C b )i 2 max kn/L, a 2 < and log (y/QR/a 2 ) < Ci log L, we have 


Pr 






+ c b 

Q6 2 


l log L , 


kn /2 + Cbdj i^^ kn log L 


Now we let Q > Cafi^^kn m&x{log L/5 2 , log 2 L/S} where C a is a function which grows linearly with 
respect to a and has sufficiently large derivative. By properly assuming L > kn , Q is also greater than 
C 0 ^ kn(l + C b 5 log T) (cr log L + log 2 kn)/5 2 and 


Pr I II 5>'H ^ W I <L 


ie r p 


We take the union bound over all p = 1, 2, ■ ■ ■ , P and have 


Pr max > Zi < 

1 i<p<p 

le r p 


5Q 


> 1 - PL~ a > 1 - L 


— CK+1 


This completes the proof for the Gaussian case. 

We proceed to the Fourier case. The proof is very similar and the key is to use Theorem 14.41 Now we assume 
ai is the l -th row of a random Fourier matrix A and thus E (aid *) = In- Here Z\ = (a^na* n — I at, a) < 8 ) bib* 
because ai is a complex random vector. However, it does not change the proof at all except the notation. 
Like the proof of the Gaussian case, we need to know R = max/ e p \\Z/\ and also a 2 . Following the same 
calculations that yielded (14.15p . (14.16jl and (14.19p . we have 

R< 16*6/1 max(||a; ; n|| 2 ,1) < ^ maxkn 


and 


,,2 u 

2 ^ P'max^' | 


L 

hmax^i 


E b ‘ b ' X - U/,«) 2 || < 

ier p 


T p \\ < 


5hma xQ kn 


R 

max v 

4 L 2 


Then we apply Theorem 14.41 and obtain 


Pr 


Zi\\ > | < 2/cnexp 

le r p 


Q6 2 


5/ima,x W 2 + 2 ^maxW 3 


< L~ 
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if we pick Q > C a ^ ax kn log L/5 2 . This implies 


Pr I max II V ZA\ < — I > 1 - PL~ Q > 1 - L 
1 l<p<P L I 

ier p 


— Ot+l 


4.2.3 Proof of Lemma 14.61 

Proof: From Lemma 14.7[ we know that for a fixed equal partition {r p }T =1 of {1, 2, • • • , L} with |r p | = Q 
and satisfying (14.61) . we have 


max sup 
1<P<P \\Xn\\ F =l 


bitfXnai^a^Q — T p X$ 


l(zT n 


< 


SQ 


(4.20) 


By setting S — | and following from || 6 ; 6 * « ^|| < ^ in (14.61) . we have 


sup 

||^n||F=l 


^btfXnawatn-Qx 


l£T p 


< sup 

||^n||F=l 


< 


btfXnatflaln - T P X f 


ler p 


sup 

||^n||F=l 


Q 


T p X n - fX 


Q_,Q_ = Q_ 

AL AL 2L 


4.2.4 Estimating the operator norm of A 

The operator norm 7 of A with A will be given for both the Gaussian case and the Fourier case such that 
U(X)\\ F < 7 ||X||i? for any X with high probability. For the Gaussian case, Lemma 1 in |Tj gives the 
answer directly. 

Lemma 4.8. For A defined in (13.41) with A a Gaussian random matrix and fixed a > 1 ; 

||.4|| < \/ N\og{NL/2) + ologL (4.21) 


with probability at least 1 — L a . 

For the Fourier case, a direct proof will be given by using the non-commutative Bernstein inequality in 
Theorem 14.41 

Lemma 4.9. For A defined in (13.41) with A a random Fourier matrix and fixed a > 1, 

MU < ^2N(log(2kN) + 1) + 1 (4.22) 

with probability at least 1 — L~ a if we choose L > log(L). 
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Proof: First note that ||*4**4|| = ||*4|| 2 . Therefore it suffices to estimate the norm of *4**4., 


*4**4(X) = btfXcua* = b l b*X{a l a* l - I N ) + X. 


1=1 


1=1 


The second equality follows from the fact that Ylf=i bib* = Ik- For each centered random matrix Zi := 
(did] - I N ) (8) b t b*, we have vec(*4**4(X)) = Ya=i Zivec(X) + vec(X) and 

ii 2 hN 

\\Zi\\ = ||(o z o? - I N ) <g> b,6m = II ^d] - 1*11 ■ \\btfW < 

That means R = maxi<;<n \\(did] — Ijv) < 8 > bib *|| < Mma ^ ,JV . In order to estimate the variance of each Zi, 
we only have to compute E (ZiZ*) since Zi is a Hermitian matrix. 

E (ZiZn < J E [(cud* - I N ) 2 ® b L b*] = 0 6 , 6 *. 


Hence the variance of l 4s 


2 / - 1 ) 


cP < 


Y j i n® bib* 


i=i 


-!) < 


Now we apply Bernstein’s inequality and for any t > 0, we have 


Pr ( Ii y^Zi\\ > t ) < 2kN ■ exp (- J ^ ^ 

l "tr / “ V cr 2 + Rt/3j 


= 2kN ■ exp 


t 2 /2 


/*L x kN/L + fi 2 max kNt/3L 


t 


72_\ 


‘ 2 ^ x kNt/3Lj 
3 Lt 

4 /^L w 


Suppose t >3, then 

Pr ^|| Zi\\ > t^j < 2 kN ■ exp 

= 2 kN ■ exp 

If we pick t = 2N(log(2kN) + 1), then 

Pr f|| Y^ Z iW > t \ < exp 

If L > a/i^ ax fc log(L), we have 

L 

||.4M||<|l5>ll + l < 2N(\og(2kN) + 1) + 1 =► 7 := ||*4|| < ^2N(\og(2kN) + 1) + 1 

i=i 

with probability at least 1 — L~ a . ■ 


„2 u 

rmax^ 
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4.3 Construction of the dual certificate 


In the last section we have shown that (j4.3[) holds true. Now we will prove that aY £ Range (4.*) satisfying 
(14.2p exists by explicitly constructing it for both, the Gaussian and the Fourier case. 


4.3.1 A is a Gaussian random matrix 


Construction of Exact dual certificate via least square method: 

It follows from Lemma fOl and Rj~Fl that is invertible on Q with || (4>^4>n) -1 || < 2 if L > C a kn log 2 L 

for some scalar C a . We construct the exact dual certificate via (see 0) 

p := vec(sgn(X 0 )), q := $*p = vec(A*(p)). (4.23) 

Let Y = A* (p); it suffices to consider q and verify whether q n = vec(sgn(X 0 )) and ||q a x||oo < |. 
Theorem 4.10. Conditioned on the event {||(4»^4»n ) 1 || < 2 }, there holds 

1 

Ilftl-L lloo < ^ 

with probability at least 1 — L~ a if we pick L > 16a/i^ ax fcn(log( L) + log(kN)). 


Proof: Note 4>^ x ;p = ( <f> s t p)i<s<k,t>n and (f) st = diag( 6 s )a( is independent of p for t > n because p only 
uses aj : 1 < j < n. In this case, 

4*8,tP = di ag (b s )p 


where a t is an L x 1 standard Gaussian random vector. Due to the independence between diag (b s )p and 

a t, 

0Lp~AT(O, ||diag(b*)p|| 2 ), 


where 


diag(b*)p || 2 < 
< 


2 2 

^llpll 2 < ^ve C (sgn(X 0 ))*(^# n )-W( S gn(X 0 )) 

^\\ sgn{ x a )r F= 2A f n . 


Therefore, <$> st p is a centered Gaussian random variable with variance a 2 at most 2Mm £ xfc,t . For any x > 0, 
we have 


Pr (l 0 MPl>^) = 


\/2na 2 
2a 

\/2nx 


e 2 ^ du < 


1 ra^ Jx \/27t ax Jx 


ue 2 ^ d u 


-e 2 fj 2 


00 2 a _ x 2 

= \ -e ^ 

x \ TT X 


-sp 

Taking the union bound over all pairs (s,t) of indices (4> st ) satisfying t > n gives 


, . . / 2 a _ kNa __ x 2 

Pr max \<p st p\ > x < kN\l -e 2 ^ < - e 572 

' 1 <s<k,t>n ’ ' 


7T X 


X 
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If we pick x — i, then | < 2y ^^y^ and 


Pr 


~* 1\ fcAhr _V 2J2p% nax knkN -^— 

max </> ,n > - < -e 2 ^ < —--—-e 18 «Lk*». 

l<s<k,t>n' s ’ t l 2 J ~ x ~ y/L 


Choosing L > 16a/i^ iax n/c(log(L) + log (kN)) implies that the probability of {max{ 1 < s < fe)t>r q \<f> si p\ > |} 
will be less than L~ a . ■ 


4.3.2 A is a random Fourier matrix 

Construction of Inexact dual certificate via golfing scheme: 

An inexact dual certificate Y will be constructed to satisfy the conditions given in (14. 2 p via the golfing 
scheme, which was first proposed in [23] and has been widely used since then, see e.g. P dHl EU]. The 
idea is to use part of the measurements in each step and construct a sequence of random matrices Y p 
recursively in Range(A*). The sequence converges to sgn(Wo) exponentially fast and at the same time 
keeps each entry of Y p small on V. We initialize T^o := 0 and set 

Y p := TV! + ^A* p A p ( sgn(Xo) - Y p _ hn ). (4.24) 

We denote Y := Yp. Let W p := Y p _q — sgn(W 0 ) be the residual between sgn(X 0 ) and Y p on 12 and 
Wq ■= Yq — sgn(W 0 ) = — sgn(W 0 ). W p yields the following relations due to (I4.24p . 

w p = Wp-t - ^A;, n Apfl(Wp _,) 

= k (9. _ a; :SI V 

where A Pi n is defined in (14.lip . Following from Lemma 14.61 ||VF p II-F decreases exponentially fast with 
respect to p with probability at least 1 — L~ a+1 and 

\\W P \\ F < i|| W^Wp, \\W p \\ f < 2~ p Vkn. (4.25) 

The value of P can be determined to make ||LFp||^ = ||Vq — sgn(X 0 )||p < Hence, 

2 -Vte < V => P > (4,26) 

“ 4 V 7 “ log 2 V ; 

The next step is to bound the infinity norm of Y on V. We first rewrite the recursive relation (j4.24p : Y 
has the following form, 

t p 

Y = --J2 A ’pApW p . 1 . (4,27) 

^ P= 1 

Our aim is to achieve IWloo < |. It suffices to make ||P^xA*Ap(^p~i)||oo < 2~ P ~ 1 Q in order to make 
IVIloo < Vqa is the operator which orthogonally projects a matrix onto V. For each p, 

EVV^WVOI^VO = T P W P _ r, 
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since W p _ i is known at step p and A*A P is independent of W p _i. T P W P _ 1 has the same support as W p _ i 
does and thus this gives E [P^xM*M p (IT p _i)] = 0. The next theorem proves that \\V^A* p A p {W p _i) < 
2 ~p-iQ_ f or a p 1 < p < P with probability at least 1 — L~ a+1 if L > PQ and Q > C Q /j,‘l lSLX kn(log(kN) + 
log(L)). 

Theorem 4.11. Conditioned on (I4.25P and for fixed p, 

Pr (UTW^WVOIU > < 2k N exp (-^^) • 

In particular this gives 

Pr MlVn^MWp-JWoo < Vl<p<pj>l -P L~ a > 1 - L~ a+ \ 

by choosing Q > 128 M ™ axQ fcn(log(fciV) + log(L)) and taking the union bound for all p. 

Proof: For any z = vec(Z) where Z : k x N has support 12, we have 

A p A p (Z) = <&;<& p z 

directly following from (13.4[i and (13.5p . Therefore, 

||P Q x,4*.A p (Z)||o 0 = max |(e s , Q> p Q> p z)\, 

where e s is a kN x 1 unit vector with its s-th entry equal to 1. Here s > kn because only the first kn 
entries of z are nonzero. The main idea of the proof is to estimate |(e s , 4>*<l> p z)| for each s G 12 - 1 and take 
the union bound over all s G 12^. For any fixed s G 12- 1 , 

( e si %^ P z) = (e s , Y 4>i<f>*z) = £<e s , <t>i = ai® h. 

i&Tp ier p 

Let zi := {e s , 4>i<p*z). Notice E(z;) = 0 because 

E((e s , <pi4>*z)) = e*(/fe (8) bib*)z = e*vec (bitfZ). 

bib*Z and Z have the same support and thus any entry of bib*Z on 12 - 1 equals to 0. Therefore each Zi 
is a bounded and centered random variable. We can use scalar version of Bernstein inequality (a special 
version of Theorem 14.4p to estimate J^ grp zi. First we estimate |z/|, 

M < \e* s <f> l (fi* l z\ = |e>j| ■ \<f>*z\ < ^Itfzl 


There holds |e*0;| < \\ai <g) b;||oo = ||flz||oo||&z||oo < qqq, which follows from the fact that ai is a row of 
DFT matrix and each entry of a,/ is of unit magnitude. Moreover, z has support 12, 


N < 




Vl 




Therefore, R := max; \zf < 


/^max | , * I , h-max,, , ,, ,, ,, , P max 11 

7r 1 - 7r ' 11 11 - 7r" V t w 
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Now we apply the Bernstein inequality in Theorem 14.41 and have the estimate for Yher p z h 

_ 1/2 _\ 

5 ^max<3ll Z H 2 / 4L2 + /^LxV'MMI V 3 W 

Note that th p _i is independent of A*A P and hence the inequality above is still valid if we choose z = 
vec(VF p _i). Since ||z|| < 2~ p+1 \/kn and t = 2~ p ~ 1 /, we have 
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We take the union bound over all s G fW and obtain 
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Pr (ll^^t^OlU > exp (-— 
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Proof of Theorem 


T2 


Now we consider the noisy case by solving the convex program (13.141) . Our goal is to prove Theorem 13.21 
The main ingredient of our proof is Theorem 4.33 in [15] . A variant of that theorem, adapted to our 
setting, reads as follows. 

Theorem 5.1. Let (p s t be the columns of $ G C LxkN , v 0 = vec(X 0 ) G C fcArxl with support on Q and 
y = &v 0 + w with ||iu|| < r/. For 5, /?, 6, 77 , r > 0 and 5 < 1, 

- J n\\ < S, max J </3, (5.1) 

l<s<k,t>n 

and that there exists a vector q = &*p G c kNxl with p G C L such that 

||9n - sgn(v 0 )|| < II^IU < 0, ||p|| < rVkn. (5.2) 

If p : 9 + 4 v ^y_ ( ;) < 1; then the minimizer v to (13.14[) satisfies 

\\v ~ Vq\\ < (Cq + CiT\/kn)r] (5.3) 

where C 0 and Cf are two scalars only depending on 5, (3, 7 , 6 and r. 

Actually, we have already proven in the previous section that we can choose <5 = = |. The only 

two unknown components are r and /3. r follows from computing the norm of p. We need to know the 
coherence /1 of $ to estimate /3 and claim that (3 < 1. Define the coherence as the largest correlation 
between two different columns of the L x kN matrix <E», 

li := max \{^ lujl ^i 2 j 2 )\, 4> s ,t = diag (b s )a t (5.4) 

> 1 / 1 2 0rji^ 2 

where <f> st is defined in (13. 7 j) . 
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5.1 Estimation of fi and (5. 


It is easy to obtain a bound of for 1 < s < k and t > n with /i, 


It suffices to bound p by 1 /Vkn in order to achieve (3 < 1. The following lemma (see Lemma A.9 in US!) 
will be used in the proof. 

Lemma 5.2. For any matrix C, any entry is bounded by its operator norm, i.e., 

max|Cj,-| < ||C||. (5.5) 

i,j 


Theorem 5.3. There holds 

max ,|| < 1 

t>n ’ 

with probability at least 1 — L~ a if 

1. A is a Gaussian random matrix and L > 4C'a/i^ iax max{/en log(L), \fkn log 2 L} and 5 = 2 \og(kN) + 

a. 

2. A is a random Fourier matrix and L > 4C'a/i^ iax /cnlog(L) and a = 2log(kN ) + a. 

Here, in both cases Ca is a function growing at most linearly with respect to a. 


Proof: We only prove the Gaussian case by using Lemma 14.31 The proof of the Fourier case is exactly 
the same. First we pick any two different columns and ( t ) i 2 j 2 f rom where 1 < ii,i 2 < fc and 

1 < ji, J 2 < N. Let be _ 

fi = {(* 1 , jl), (GJ 2 ), (i 2 ,jl), (* 2 , J 2 )}- (5.6) 

|f2| < 4 and contains 4>i 1 j 1 and 4 > i 2 ,j 2 - By Lemma 14731. — 1^|| < 5 with probability 1 — L~ a if 

L > 4 G 5 max{logL/h 2 , log 2 L/S}. Therefore \{4 > i 1 . J1 i </>j 2J - 2 )| < 5 follows from Lemma [5721 Then we take the 
union bound over {kN) 2 /2 pairs of indices {i\,j 1 ) and (* 2 ,^ 2 )- Therefore, ji = max^ ^j, ^ \ (<f>i u j 1 , < f > i 2 ,j 2 )\ — 
1/Vkn with L > AC a max{fcn logL, \fkn log 2 L} and probability at least 1 — ( kN) 2 L~ a > 1 — L~ a if 
5 = 2 log(kN) + a. ■ 


5.2 Proof of Theorem 13.21 when A is a Gaussian random matrix 

Proof: Let p = sgn('Uo) and q = <&*p as defined in (14.231) . From Lemma 14.31 Theorem 

14.101 and Theorem 15.31 we have 

||^ n -In||<«5 = -, 6 = ||g n x||oo < (3 < 1 

with probability at least 1 — 0(L~ a+1 ). 

\\p\\ 2 = sgn(r;o) T (l > ol > o) _1 sgn(i7 0 ) < 2 kn ==► ||p|| < V2kn, r = \Fl 

and also 

„ P 1 1 

p = 9 A - -= -< —|- ■=- < 1 , 

4^7(1 - 5 ) ~ 2 2^27 

where 7 is the operator norm of A and is certainly greater than 1. Applying Theorem 15.11 gives the desired 
result. ■ 
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5.3 Proof of Theorem 13.21 when A is a random Fourier matrix 


Proof: We have an explicit form of p from (14.27ft 


p = ~(A 1 (W 0 ),.-- ,Ap(TFp_i)) t , 

such that q = vec(T), Y = A*(p) = — ^ J2p=i A*A p (W p _i) and Y is an inexact dual certificate. The 
norm of p is 

llp|l 2 = §EllA(vr p -,)||J.. 

i= 1 

For each A p (VF p _i), 


IIA>(WVi)|| 2 = < ||||W P - 1 |||. < ^4-* +l kn, 


L 


which follows from (14.25ft and HA^Ap^|| < in Lemma 14.61 Therefore, 



This implies r = < 1\/‘2P where P is chosen in (|4.26[) . We also have 8 = 9 = \ and /3 < 1 which are the 

same as Gaussian case and p = 9 + 4 v ^ 7 / (i-^) < 1- Now we l iave everything needed to apply Theorem 15.11 
The solution X to (I3.14p satishes 


||v — u 0 ||p = ||X — XoIIf < (Co + Ci'/pVkn)q 1 
with P = 0(log(iV)), and C 0 , C\ are two constants. ■ 


6 Numerical simulations 


6.1 Recovery performance of SparseLift 


We investigate the empirical probability of success of SparseLift over different pairs of ( k , n ) for fixed L 
and N. We fix L = 128, N = 256 and make both k and n range from 1 to 15. We choose B as the first 
k columns of an L x L unitary DFT matrix and A is sampled independently for each experiment, x has 
random supports with cardinality n and the nonzero entries of h and x yield A/"(0,1). For each ( k,n ), 
we run (13.111) 10 times and compute the probability of success. We classify a recovery as a success if the 
relative error is less than 1%, 


IX - X 


OF 


IX 


OF 


< 0.01, X 0 = h 0 XQ. 


Note that SparseLift is implemented by using CVX package [23] . The results are depicted in Figure [Q The 
phase transition plots are similar for the case when A is a Gaussian random matrix or a random Fourier 
matrix. In this example choosing kn < 70 gives satisfactory recovery performance via solving SparseLift. 
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The Frequency of Success: L = 128, N = 256 



2 4 6 8 10 12 14 

n:1 to 15 (Gaussian Case: Performance of Sparselift) 


Frequency of Success: L = 128, N = 256 



2 4 6 8 10 12 14 

n:1 to 15 (Fourier Case: Performance of Sparselift) 


Figure 1: Phase transition plot of performance by solving SparseLift (13. lip directly. 


For comparison, we illustrate the performance when we solve the convex optimization problem that com¬ 
bines the nuclear norm and sparsity as in (I3.10p . see Figures [2] and [3} The values for A are A = 0.1 (the best 
value for A for this experiment) and A = 10. The settings of other parameters are exactly the same as the 
scenario of solving SparseLift. Figure [2] shows that (I3.10p with A = 0.1 performs similar to solving (13. lip . 
However, if A = 10 the results are clearly worse in general, in particular for the Gaussian case, see Figure 
El For Fourier case, choosing A = 10 might do better than SparseLift for some pairs of ( k,n ). The reason 
behind this phenomenon is unclear yet and may be due to solver we choose and numerical imprecision. All 
the numerical results support our hypothesis that the performance of recovery is sensitive to the choice 
of A and SparseLift is good enough to recover calibration parameters and the signal in most cases with 
theoretic guarantee. This is in line with the findings in [32] which imply that convex optimization with 
respect to a combination of the norms || • ||i and || • ||* will not perform significantly better than using only 
one norm, the G-norrn in this case. 

We also show the performance when using a mixed G-G-norm to enforce column-sparsity in X instead 
of just G-norrn. In this case we note a slight improvement over (13. lip . As mentioned earlier, analogous 
versions of our main results, Theorem 13. 1 1 and Theorem 13.21 hold for the mixed G-G-norm case as well and 
will be reported in a forthcoming paper. 


6.2 Minimal L required for exact recovery is proportional to kn 

The minimal L which guarantees exact recovery is nearly proportional to kn , as shown in Figure 0 The 
figure shows the results of two experiments: one is to keep k = 5 and let n vary from 1 to 15; the other is 
to fixe n — 5 and let k range from 1 to 15. In both experiments, N is chosen to be 512, L varies from 10 to 
400 and A is chosen to be a Gaussian random matrix. We run 10 simulations for each set of ( k , n, L ) and 
treat a recovery as a success if the relative error is less than 1%. The result provides numerical evidence 
for the theoretical guarantee that SparseLift yields exact recovery if L > C a kn\og 2 L log(kN). 

We also notice that the ratio of minimal L against kn is slightly larger in the case when n is fixed and k 
varies than in the other case. This difference can be explained by the log(/AV)-term because the larger k 
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The Frequency of Success: L = 128, N = 256 
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Frequency of Success: L = 128, N = 256 


i 



2 4 6 8 10 

n:l to 15 (Gaussian Case and min 



2 4 6 8 10 12 

n:l to 15 (Fourier Case and min || • ||i + 0.1|| ■ ||* solver) 


Figure 2: Phase transition plot of performance via 


+ 


minimization for A = 0.1. 


is, the larger L is needed. 

However, these numerical observations do not imply that the minimum L required by a different (numeri¬ 
cally feasible) algorithm also has to be proportional to kn. Indeed, it would be very desirable to have an 
efficient algorithm with provable recovery guarantees for which on the order of about k + n (modulo some 
log-factors) measurements suffice. 

6.3 Self-calibration and direction-of-arrival estimation 

We examine the array self-calibration problem discussed in Section 11.11 We consider a scenario of three 
fully correlated equal-power signals. Note that the case of fully correlated signals is much harder than 
the case of uncorrelated (or weakly correlated) signals. Standard methods like MUSIC [35j will in general 
fail completely for fully correlated signals even in the absence of any calibration error. The reason is that 
these methods rely on the additional information obtained by taking multiple snapshots. Yet, for fully 
correlated signals all snapshots are essentially identical (neglecting noise) and therefore no new information 
is contained in multiple snapshots. 

We use a circular array consisting of L = 64 omni-directional sensors. We assume that the calibration error 
is changing slowly across sensors, as for example in the case of sensors affected by changing temperature 
or humidity. There are of course many possibilities to model such a smooth change in calibration across 
sensors. In this simulation we assume that the variation can be modeled by a low-degree trigonometric 
polynomial (the periodicity inherent in trigonometric polynomials is fitting in this case, due to the geometry 
of the sensors); in this example we have chosen a polynomial of degree 3. To be precise, the calibration 
error, i.e., the complex gain vector d is given by d = Bh , where B is a 64 x 4 matrix, whose columns 
are the first four columns of the 64 x 64 DFT matrix and h is a vector of length 4, whose entries are 
determined by taking a sample of a multivariate complex Gaussian random variable with zero mean and 
unit covariance matrix. The stepsize for the discretization of the grid of angles for the matrix A is 1 
degree, hence N = 180 (since we consider angles between —90° and 90°). The directions of arrival are 
set to —10°, 5°, and 20°, and the SNR is 25dB. Since the signals are fully correlated we take only one 
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Frequency of Success: L = 128, N = 256 
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Figure 3: Phase transition plot of performance via || • ||i + A|| • ||* minimization for A = 10. 


snapshot. 

We apply SparseLift, cf. (I3.14p . to recover the angles of arrival. Since the sensing matrix A does not fulfill 
the conditions required by Theorem 13.21 and the measurements are noisy, we cannot expect SparseLift to 
return a rank-one matrix. We therefore extract the vector with the angles of arrival by computing the right 
singular vector associated with the largest singular value of X. As can be seen in Figure [6] the estimate 
obtained via self-calibration provides a quite accurate approximation to the true solution. 


6.4 5G, the Internet of Things, and SparseLift 

We briefly discussed in Section 11.21 how self-calibration problems of the form analyzed in this paper are 
anticipated to play a role in 5G wireless communication systems in connection with the Internet of Things, 
sporadic traffic and random access. Let us assume we deal with a CDMA communication scheme. To 
accommodate as many sporadic users as possible, we consider an “overloaded” CDMA system. This 
means that the matrix A containing the so-called spreading sequences (aptly called spreading matrix) has 
many more columns than rows. We use a random Fourier matrix for that purpose. Furthermore, we assume 
that the transmission channel is frequency-selective and time-invariant for the time of transmission of one 
sequence, but it may change from transmission to transmission. The channel is unknown at the receiver, it 
only knows an upper bound on the delay spread (the length of the interval containing the non-zero entries 
of h). Moreover, the transmitted signal experiences additive Gaussian noise. The mathematician who 
is not familiar with all this communications jargon may rest assured that this is a standard scenario in 
wireless communication. In essence it translates to a linear system of equations of the form y = DAx + w, 
where A is a flat random Fourier matrix, D is of the form D = diag (Bh) and w is AWGN. To be specific, 
we choose the transmitted signal x to be of length N = 256 with n = 5 data-bearing coefficients (thus, 
||x||o = 5), the locations of which are chosen at random. The 256 spreading sequences of length L = 128 are 
generated by randomly selecting 128 rows from the 256 x 256 DFT matrix. Thus the system is overloaded 
by a factor of 2. The delay spread is assumed to be 5, i.e., k = 5 and the L x k matrix B consists of the 
first 5 columns of the unitary L x L DFT matrix. The SNR ranges from OdB to 80dB (the latter SNR 
level is admittedly unrealistic in practical scenarios). 
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Figure 4: Phase transition plot of performance via || ■ || li2 minimization. 


We attempt to recover x by solving (13.1211 . Here we assume w { ~ A7(0, a 2 ) and ||iy|| 2 /a 2 ~ xi- We choose 
V = (L + x^L) x ! 2 o as [T] in order to make X 0 inside the feasible set. 

The result is depicted in Figure [71 for comparison we also show the results when using a Gaussian random 
matrix as spreading matrix. We plot the relative error (on a dB scale) for different levels of SNR (also 

on a dB scale). The horizontal axis represents SNR = 101og 10 ^J and the vertical axis represents 

101og 10 (Avg.Err.). We run 10 independent experiments for each SNR level and take the average relative 
error. The error curves show clearly the desirable linear behavior between SNR and mean square error 
with respect to the log-log scale. 


7 Conclusion and outlook 

Self-calibration is an emerging concept that will get increasingly important as hardware and software will 
become more and more intertwined. Several instances of self-calibration, such as blind deconvolution, 
have already been around for some time. In this work we have taken a step toward building a systematic 
mathematical framework for self-calibration. There remain more open questions than answers at this stage. 
Some obvious extensions of the results in this paper will be to consider the case where both x and h are 
sparse. Another useful generalization is the scenario where x is replaced by a matrix, like in the direction- 
of-arrival problem with multiple snapshots. Lifting the self-calibration problem in this case would lead 
to a tensor-valued (instead of a matrix-valued) underdetermined linear system {20j. The computational 
complexity for solving the associated relaxed convex problem will likely be too expensive for practical 
purposes and other methods will be needed. Extensions of the Wirtinger-Flow approach in ra seem to 
be a promising avenue to derive computationally efficient algorithms for this and other self-calibration 
scenarios. Another important generalization is the case when there is mutual coupling between sensors, in 
which case D is no longer a diagonal matrix. Yet another very relevant setting consists of D depending 
non-linearly on h, as is often the case when the sensors are displaced relative to their assumed locations. 
In conclusion, we hope that it has become evident that self-calibration is on the one hand an inevitable 
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Figure 5: Phase transition plot of performance via SparseLift (implemented by using Yalll [44]). Here 
N = 512 and L varies from 10 to 400 for each pair of (k,n). The left figure: k — 5 and n is from 1 to 15; 
the right figure: n = 5 and k is from 1 to 15. 


development in sensor design in order to overcome some serious current limitations, and on the other hand 
it is a rich source of challenging mathematical problems. 
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